Research iteration 2

Sjoerd Beetsma, Maarten de Jeu Class V2A - Group 5

Introduction

Like mentioned in research iteration 1, we have divided our 3 research questions across 3 datasets. The first 2 research questions correspond to the Chemical datasets, and the last questions corresponds to the review dataset.

Throughout this notebook, we'll attempt to stick to the iterative CRISP-DM workflow as much as possible, and divide our notebook into chapters corresponding to phases from this philosophy. This chapter division is only a rough indication of what can be found where though, because the highly iterative workflow requires us to look back/forward in the process quite frequently.

Specifically, everything you're looking for can be found in this order:

  1. Business understanding: Chemical dataset.
  2. Data understanding: Chemical dataset.
  3. Data preparation: Chemical dataset.
  4. Modelling: Research question 1.
  5. Modelling: Research question 2.
  6. Business understanding: Review dataset.
  7. Data understanding: Review dataset.
  8. Data preparation: Review dataset.
  9. Modelling: Research question 3.

Business Understanding: Chemical dataset.

The context for our research questions is quite vague from the side of our client, but we'll be examining the following questions:

  1. Can we predict the quality of a red-wine according to its chemical properties?
  2. Can we predict a wine's color based on it's chemical properties?

Data Understanding: Chemical datasets.

The dataset for the first research question is aquired from https://archive.ics.uci.edu/ml/datasets/wine+quality along a dataset about red-wine there is also a dataset about white-wine, the first will be used for our first research question and the datasets combined will be used for our second research question.

The variables in the chemical datasets are:

1 - fixed acidity: Fixed acids found in wines are tartaric, malic, citric, and succinic.
2 - volatile acidity: Steam distillable acids like acetic acid, lactic, formic, butyric, and propionic acids.
3 - citric acid: One of the fixed acids found in a wine
4 - residual sugar: Residual sugar in a wine is the natural grape sugars leftover in a wine after the fermentation is finished.
5 - chlorides: Chlorides are a major conributor to the saltiness of a wine.
6 - free sulfur dioxide:The part of sulfur that remains free after the sulfur is binded
7 - total sulfur dioxide:Free sulfur dioxide plus binded sulfur dioxide.
8 - density: Density of a wine, usually increased by fermentable sugars.
9 - pH: pH to specify the acidity/basicity of a wine, most wines ranging with a pH from 3 to 4.
10 - sulphates: Use of sulpates slows down the growth of yeasts keeping and prevents the oxidation keeping the wine fresh for longer.
11 - alcohol: Alcohol percentage of a wine
12 - quality: A quality score between 0 and 10, based on sensory data

The source of the data also states that they don't know if all variables are relevant in deciding the quality score of a wine.

Source:

Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal @2009

We import some libraries and the dataset to examine the data through code.

For the research question one we will only need the red-wine dataset. The second research question requires a combined dataset with labels added for the color of the wine (white or red) the two datasets will be explored and cleaned at the same time but separately, performing similar operations on both. Merging the two datasets before the data-cleaning would result in records being designated outliers that shouldn't be, correlations not being detected in certain datasets that should be, etc. We'll merge the two datasets after cleaning them for use in the second research question.

Let's load in both red and white wine datasets

Let's take the head of one of the chemical property datasets to have a first look at the submissions of the data.

As described by the source each row seems to correspond with a individual wine. With eleven different feature columns describing the chemical properties of the wine and one target column, quality.

To access columns easier in the future change white spaces in column names to underscores.

To see how much data entries we have we will check the amount of rows in both the red and white datasets.

Lets change the column name white spaces to underscores to make life easier.

Target and feature variables

For research question 1, all the columns describing chemical properties will be considered as a feature variable and the column quality represents the target variable, the variable we want to predict. Lets safe them in a variables to access them easily from the final dataframe.

The feature variables (at the start) for research question are the same ones as those for question 1, but will change after some investigation, and the target variable is currently not available as column (the eventual target variable will be whichever dataset the record is in now).

Scales of measurements

To choose a appropriate model for our research-questions and available data it's necessary to have a understanding of all the scales of measurements for all the relevant variables.

As can be seen all the chemical properties have continuous scale of measurement and the quality columns has a Discrete scale of measurement.

Central tendancies and dispersion measures

Using describe we can see some important central tendancies and dispersion measures about the dataset.

From the describe we can tell that there are quite a few columns with a big difference between maximum and minimum values and also a low standard deviation which indicate outliers in for example: Residual_sugar, chlorides, free_sulfur_dioxide ,total_sulfur_dioxide.

Just like the red-wine dataset, the white-wine dataset has similar differences in maximum and minimum values.

Lets take a more visual look at the distribution of all data through a histogram for each of the feature attributes Starting off with the red wine dataset.

In the the red-wine dataset pH and possibly density have a gaussian distribution. Other variables seem to have a skewed distribution.
Moving on to the white-wine dataset.

Moving on to the white wine dataset:

In the the white-wine dataset only pH seems to have a gaussian distribution maybe density as well after removing outliers. Other variables have a skewed, possibly lognormal, distribution.

From the count plots above we can tell the quality for red-wines range from 3 to 8 with 5 being the most common quality rating. For white-wines it ranges from 3 to 9 with 6 being the most common quality rating.

Outliers

To get a visual understanding of the outliers in the feature columns each feature gets a boxplotted with the Q1 target variable quality. Giving a small summary of the minimum, Q1, Q2 (median), Q3 and the maximum of each attribute plotted against quality scored to give a view of outliers at all quality levels.

First boxplot all the feature variables on the y-axis against the Q! target variable on the x-axis for the red-wine dataset.

Now do the same for the white-wine dataset

As can be seen from the boxplots all of our current variables contain outliers.

All outliers in the above boxplots seem to be plausible and not from incorrect data, like some attributes in iteration 1 had. From the boxplot with alcohol on the y axis and quality on the x axis we can see that a trend of a rising median alcohol percentage the higher the quality of the wine.

Correlations

For later models it's important to know what variables have a (linear) correlation between each other. To find linear correlations and their direction/strength we make use of Pearson's correlation coefficient.

In the correlation matrices graphed above you can see which attributes have a correlation to other attributes. Starting with Q1 our target variable 'quality', we can see quality has a few correlations with the strongest one being alcohol for both red and white wines and a few weaker ones like volatile acidity, sulphates and citric acid for red wines and density and chloride for white wines. Because quality is our Q1 target variable it's the independent attribute in the correlations.

Quality is dropped for research question 2, so for that it's not relevant.

Besides there are some correlations among chemical properties: Fixed acidity has strong correlation with pH, but it’s still an independent type. pH However is a dependent type; it depends on the former. Volatile acidity, residual sugar, sulphates, chlorides, and density are all independent data types. Total sulfur dioxide is dependent on free sulfur dioxide, but free sulfur dioxide is independent.

Strong correlations along features might need to be removed during the data preparation phase.

Data Preparation: Chemical dataset.

Lets start of the data preparation by checking the datatypes and clean or change them if necessary.

Red-wine quality datatypes:

White-wine quality datatypes:

These all seem to be in order, so we can now move on to checking and removing or replacing any NA values in the datasets.

But luckily, all data seems to be complete, so no need to worry about that either.

Removing outliers

A remove outliers function is created but and used to remove extreme outliers. This is will make a regression algorithm peform better because the RMSE is sensitive to outliers.

Removing all extreme the outliers leaving the mild ones in the dataset with a outer fence. 3 IQR below Q1 and 3 IQR above Q3.

The red-wine dataset contained 1599 rows and the white-wine 4898 before removing the outliers lets remove the outliers and check how many are left.

1435 of the 1599 rows are left in the red-wine data, 12% of the rows contained extreme outliers.
And in the white-wine data 4690 of the 4898 rows are left, 4% of the rows contained extreme outliers.

Duplicates

Duplicate values would probably cause overfitting in the worst case, and imbalanced models in the best case. Let's check if either dataset contains any.

Both contain duplicates that need to be removed, so let's get rid of them.

The red and white wine chemical property datasets are cleaned but still needs normalizing. Because we don't need normalized data for our first research question and only for our second one we will only normalize the red and white combined dataset.

Now both the red and white dataset are ready to be combined into one dataset with a extra column for being red or not.

Normalizing data

Machine learning algorithms using a euclidean distance function benefit from working with normalized data.
We'll define a function that normalizes a pandas Series or Dataframe object. We won't use it for now, because it best to only do it when needed, but it's good to have.

Data cleaned

We now have the dataset_red, dataset_white, and dataset_red_white Dataframes, which are all cleaned. Nothing is normalized as of yet, but we have defined a procedure to do this with.

Modeling: Research question 1

Test and train data

The datasets will be split into a train and test dataset for the models to learn and test their performance.

Baseline model

We start out by creating a baseline model that always predicts the mean of the target variable in the training set, so that we can attempt to improve this score with other models. Let's check the RMSE of this model.

The baseline model scores a RMSE of 0.81 quality points, this score is what we want to improve upon with further models.

Implementing a machine learning model

For our first machine learning model we will implementing the simplest form of multiple feature regression: multiple linear regression. This model doesn't have any hyper-parameters to tune. Instead of hyper-parameter tuning we use a Feature-selection algorithm.

Recursive feature elimination

From our feature selection we can see the big improvements starting with just one feature and increasing to the 6 best features. From 6 to our total number of features 11 we just see a very small improvement. To keep the complexity of the model down and emphasize the most important features we will be using the 6 best ones according to the recursive feature elimination.

Let's check the 6 selected features.

Feature variables: volatile_acidity, chlorides, density, pH, sulphates and alcohol. These 6 variables will be used in our linear model.

Linear regression

Fit the model to the X train and y train data

Get the mean_squared_error to see how good the prediction is vs the actual values.

Lets see how multiple linear regression compares to the baseline model. Get the RMSE to decide the performance of the linear model.

The multiple linear regression scored a RMSE 0.63 compared to our baseline this is about a 22% lower RMSE, meaning the baseline has already been improved and surpassed by the first model. But it's not quite as low as we want it yet. Lets try to improve more with polynomial regression. Through polynomial regression we can see if the number of polynomials will result in a better peformance. A polynomial model with just 1 degree is the same as linear regeression.

For our hyper-parameter, the degree of polynomials we will be trying 1 to 7 degrees.

Polynomial regression

Make the model

Train a model for every degree earlier specified degree.

Let's see how well the polynomial regression peformances with different degrees of polynomials. For polynomial regression we will also be using the 6 RFE selected features.

Polynomial regression peformed best with a polynomial degree of 2. But didn't any noteable improvents from 1 to 3 degrees of polynomials.

We have implemented multiple linear and polynomial regression in combination with recursive feature elimination. Our models peformed better then the baseline by about 22% improvent in the measured RMSE.

Let's try 2 other regression algorithms, Lasso and Ridge regression.

Ridge regression will reduce the impact of features that are not important when predicting the target value. Lasso regression can eliminate many features ignoring the ones that don't have significant impact.

For Ridge regression we'll be trying out alphas 0 to 2 in steps of .2.

Ridge regression

Make and train the models with different alphas.

Test the models their peformance by predicting on the test dataset getting the RMSE from the prediction and the actual values.

Ridge regression has the best peformance at a alpha at alpha of .4 measuring a RMSE of 0.63

Lastly lets try Lasso regression.

Alphas that will are 0.001 to .1 with steps of 0.01

Lasso regression

Make the model

Train the model with the different alphas.

Get the RMSE to decide the peformance of the linear model

Lasso regression has the best RMSE score when using a very low alpha, in this case .0001. Lasso regression didn't improve upon Ridge regression.

Ridge regression results

Let's investigate the Ridge model more to see which coefficients had the most impact in the prediction.

Take out the best peforming Lasso model.

Making a dataframe with the features and their corresponding coefficient and sort them.

And plot them in a barplot to get a visual representation of the impact of each feature variable.

From the barplot can be seen chlorides and volatile_acidity have the most negative impact, and sulphates has by far the most positive impact.

As reminder in the feature selection for multiple linear and polynomial regression we came to following 6 feature variables:
volatile_acidity, chlorides, density, pH, sulphates and alcohol.

While in the Ridge regression model the 6 most impactful feature variables are:
volatile_acidity, chlorides, citric_acid, pH, sulphates and alcohol.

Since the most impactful 2-pair of feature variables from the Ridge regression: chlorides and volatile acidity. Overlap with the features obtained from RFE. Ridge regression's predictions will be visualized.

Let's predict on the test data with the Ridge regression model.

By making a regression plot between the actual quality of the test set against the predicted quality we can see how well our model did. The tighter the spread of the dots around the line the better more accurate the model predicts.

To see the linear correlation in the model the most impactful two-pair feature variables will be plotted against the target variable. In our case chlorides and volatile acidity together have a strong negative impact on the target variable quality. Let's plot them in a 3d scatter plot.

Conclusion first model

For our first research question:
With what accuracy can we decide the quality of a red-wine according to its chemical properties? We ended up implementing 4 models: Linear regression, Polynomial regression, Ridge regression and Lasso regression.
To be reminded of each of their peformance let's see the RMSE of each model:

Compared to our RMSE of 0.81 from the baseline each model has around a 20% lower RMSE. With Ridge regression scoring the lowest RMSE by a very small margin compared to the other implemented models.

In conclusion: We can predict the quality of a redwine with a RMSE of .63 which is a significant improvement over the baseline but not quite the peformance to predict the quality with a small margin of error. We came to the findings that there are 2 chemical properties: chlorides and volatile_acidity that cause the biggest negative impact on the quality of a redwine. Sulphates and alcohol on the otherhand seem to have a positive impact on the quality.

Modeling: research question 2.

Test and train data

For convenience, we'll add an overview of the 'combined' dataset. Then we'll split the dataset into a test and train portion. As a reminder: we'll be using the chemical properties as feature variables to try to determine whether a wine is red of white from these. The target variable is 'is_red', a nominal(boolean) variable with 2 possible values. We'll also standardize the feature variables, because we're using a model with euclidean distance measurements after all.

Baseline model

We'll start out by creating a simple model that always guesses the mode measurement of the target variable in the train dataset, and seeing what accuracy that gets, so that we have a score to attempt to improve upon.

The majority of the test dataset consists of white wine, so that's what we'll blindly be guessing with the baseline model.

The baseline model is already able to get an accuracy of +/- 77.21% on the test dataset.

We'll try to improve on this, starting with a simple model and moving towards more complex ones. We'll start out using KNearestNeighbors because of it's simplicity, but because of the over-representation of white wine, we suspect other models might be able to do better.

Normally, one would build in business-knowledge to pick a few variables that might be relevant to start out with. In our case, we have no business expert, so we'll start with many attributes, and eliminate any unnecessary ones along the way.

Hyper-parameter wise, we'll start by experimenting with different values for 'k'. Our classes are way bigger, but we'll call the max sensible value for k 100 for now, starting at 1, with any odd values being fair game because we only have 2 target classes.

It appears (k < 20 ∧ k > 0) is the most promising area, so let's have a look at the exact values.

We're getting above 99% accuracy, with the peak around 99.60%. Out of all the choices with peak accuracy, it's good to pick a higher setting for k to avoid overfitting. k = 9 has this top score for instance, and would probably be a good choice for production.

99.6% is an incredible score, which would mean a model is terribly useful. So incredible, that we should probably consider the fact that something might have gone wrong. We split our data into a training and a test test:

We removed duplicates that could end up in both the train and the test dataset from the dataset in data preparation.

Perhaps certain chemical values really do have big discernible gaps between red and white wine. We know that most attributes have normal and lognormal distributions in either the red or the wine dataset individually, but one would expect there to be obvious differences between the distributions in either the red or wine dataset respectively.

From this improved we can see that there are indeed some quite descriptive variables when it comes to predicting what color a wine is, which could explain the high accuracy.

We can divide the attributes into 3 categories based on these distributions (based on visual inspection, and no formal math or logic of any kind):

  1. Attributes that tell us a lot about the potential color of a wine.
    • Volatile Acidity
    • Chlorides
    • Total sulfur dioxide
  2. Attributes that tell us a little, or might tell us a lot about the potential color of a wine
    • Fixed acidity
    • Residual sugar
    • Free sulfur dioxide
    • Density
  3. Attributes that tell us close to nothing about the color of a wine
    • Citric acid
    • pH
    • Alcohol

We'll synthesize some new Train/Test feature datasets from the old ones, based on the first category, to train some new models with to see if a stripped down model can perform just as well.

We can see that a KNN-model based on only 3 attributes (volatile_acidity, chlorides and total_sulfur_dioxide) can still predict what color a wine is with 99.20% accuracy when k = 47 (in the test set), which could also be interesting for an eventual model in production.

Because 3 attributes turn out to be the most important, we can gain some insight into the actual look at the datapoints and their classifications with a 3D Scatterplot, with the 3 most important attributes for discerning a wine's color on the axes.

Because we're dealing with so many Gaussian distributions, with distinctive differences between the target categories for some attributes, this problem could also be well suited for a Gaussian Naive Bayes Classifier.

Even though the model we have right now already performs quite well, it could be interesting to try this out in an attempt to reach 100% accuracy. We'll try a Gaussian Naive Bayes classifier with both all the chemical properties, and the stripped down chemical properties.

With 99.1% accuracy with all feature variables, and 97.8% with the stripped down feature variable list, the Gaussian Naive Bayes classifier performs worse then K Nearest Neighbors in both scenarios, which is quite unexpected. A possible real-world advantage of this could be the fact that this classifier could return the probabilities for a given record being in either class, which could be useful.

Conclusion research question 2

With the use of a fairly simple algorithm like K Nearest Neighbors, one can easily classify the color of wine based on certain chemical properties with an accuracy above 99%, even when just looking at volatile acidity, chlorides and total sulfur dioxide. Simplicity seems to be key in this example, because a more complex model like a Gaussian Naive Bayes classifier performs every so slightly worse.

Research question 3

Business understanding

For our third research question we have a dataset about wine reviews from different smelters with information about the origin of the wine, price and their review.
With our research question being: Can we distinguish between logical clusters of wineries? (Premium, budget, high-quality, etc...)

Data Understanding Wine-review dataset.

From our dataset's source ..., we have a list of the different attributes:

1 - country
2 - description
3 - designation
4 - points
5 - price
6 - province
7 - region_1
8 - region_2
9 - taster_name
10 - taster_twitter_handle
11 - title
12 - variety
13 - winery

The clearest way to explain what's contained in each of these columns is to look at them.

Let's load in the dataset and have a first look.

At first look it appears most of the (usable) variables are nominal, with points and price as the only numerical (discrete) values. We have some columns that initially seem fairly useless for the types of analysis that we will most probably be using for this project, like 'description', but we'll keep them just in case we end up doing anything like a sentiment-analysis type model. It also appears we have a redundant index columns called "Unnamed: 0".

Target and feature variables

We're not quite sure what feature variables we'll be using for the third question, but we know we'll be grouping by 'winery'. We'll start out by using 'price' and 'points' as further feature variables, but during the modelling stage we might end up using more.

Considering we're looking for logical clusters (unsupervised learning), there are no target variables.

Scales of measurement

Like mentioned earlier, we're mostly dealing with categorical (nominal, specifically) variables in this dataset. There are 2 numerical values. Points and price are both discrete values.

Central tendencies and dispersion measures

We can examine the spread of values of the numerical variables through histograms:

Points has an obvious Gaussian distribution. It does appear the price graph is made quite unreadable by some outliers. We'll have a proper look at those later, let's ignore them for now to have a better look at the distribution.

It appears that the price column in lognormally distributed.

Outliers

Because we won't be comparing against something, we'll be using a seperate way of creating boxplots to explore outliers for the numeric values.

It appears the median' amount of points is around 88, and quite symmetric. There are a couple of outliers around 100, but nothing extreme.

The boxplot for price is barely a boxplot because of all the outliers. Like we already noticed with the histogram, most measurements fall within the 0-100 range, but there are some extremely high outliers. For data exploration, we'll create a separate column with the outliers removed for price.

Correlations

We'll use the pearson correlation coefficient to see if there's a linear correlation between the only 2 numerical columns in the dataset. Because of the extreme outliers in price, it might be sensible to also check this with the outliers removed.

It appears there is a weak linear correlation between price and points, when disregarding outliers. Because we're looking for cluster, this is not terribly relevant, but still noteworthy.

Data preparation: Review dataset

Having explored the data, we're ready to clean it up. We have already separated the outliers in a separate column in the dataset, because that couldn't wait until data preparation. First, let's get rid of the unnamed index column, because that's not useful in any way.

And let's change the categorical values from 'object', to 'string', to 'category'.

Modelling: Review dataset

Our goal is to find logical clusters for different types of wineries, using a clustering algorithm. First, we create a separate dataframe from the original dataframe grouped by winery, with the mean value of the point/price value of that winery's wine as columns.

For many clustering algorithms, it's useful to have the data normalized as well. Let's do that in seperate columns.

Let's have a look at the data:

So far, it mostly just looks like a thick cloud. Hopefully, a clustering algorithm will be able to see through the fog and give us more insight.

We'll start out by using kMeans on the (normalised) points and price because of it's simplicity, and move onto using more complex models and/or adding more data in case it doesn't give any useful results. Even though it's doubtful anything with k > 20 will be useful, we'll still try everything from k=2 to k=30, to see how the model behaves.

Considering the fact there's no clear 'elbow' in the elbow plot, and no clear clusters in the previously constructed graph, it doesn't look very promising. We can look at a visualization at k=6 of the result to confirm:

Like we expected, not very useful. It's doubtful another algorithm would be able to make sense out of that cloud, so the way to go is probably to attempt to use the curse of dimensionality to our advantage. There is one problem: there isn't much numerical data available to use in our models. All the categorical values in the dataset are nominal, so that would mean using a get_dummies() construction to turn them into useful data. Unfortunately, all the nominal variables have too many permutations to realistically use, so we'll have to get creative.

In interesting metric could be the amount of wine reviews listed for a single winery. We'll create a new grouped by dataframe, with this column added, normalize it's columns, and visualize it.

Conclusion research question 3.

Visually, we can see there's still no sensible clusters to be found. There is no more numerical data, and creating even more from what we have would be fairly pointless. Unfortunately, we have to conclude that there's no useful insight to be gained from clustering on our current data, as there just aren't many numbers to work with.